home *** CD-ROM | disk | FTP | other *** search
-
-
-
- REGEX(3) REGEX(3)
-
-
- NNAAMMEE
- regcomp, regexec, regerror, regfree - regular-expression
- library
-
- SSYYNNOOPPSSIISS
- ##iinncclluuddee <<ssyyss//ttyyppeess..hh>>
- ##iinncclluuddee <<rreeggeexx..hh>>
-
- int regcomp(regex_t *preg, const char *pattern,
- int cflags);
-
- int regexec(const regex_t *preg, const char *string,
- size_t nmatch, regmatch_t pmatch[], int eflags);
-
- size_t regerror(int errcode, const regex_t *preg,
- char *errbuf, size_t errbuf_size);
-
- void regfree(regex_t *preg);
-
- DDEESSCCRRIIPPTTIIOONN
- These routines implement POSIX 1003.2 regular expressions
- (``RE''s); see _r_e___f_o_r_m_a_t(7)_. _R_e_g_c_o_m_p compiles an RE writ-
- ten as a string into an internal form, _r_e_g_e_x_e_c matches
- that internal form against a string and reports results,
- _r_e_g_e_r_r_o_r transforms error codes from either into human-
- readable messages, and _r_e_g_f_r_e_e frees any dynamically-allo-
- cated storage used by the internal form of an RE.
-
- The header _<_r_e_g_e_x_._h_> declares two structure types, _r_e_g_e_x___t
- and _r_e_g_m_a_t_c_h___t, the former for compiled internal forms and
- the latter for match reporting. It also declares the four
- functions, a type _r_e_g_o_f_f___t, and a number of constants with
- names starting with ``REG_''.
-
- _R_e_g_c_o_m_p compiles the regular expression contained in the
- _p_a_t_t_e_r_n string, subject to the flags in _c_f_l_a_g_s, and places
- the results in the _r_e_g_e_x___t structure pointed to by _p_r_e_g.
- _C_f_l_a_g_s is the bitwise OR of zero or more of the following
- flags:
-
- REG_EXTENDED Compile modern (``extended'') REs, rather
- than the obsolete (``basic'') REs that are
- the default.
-
- REG_BASIC This is a synonym for 0, provided as a coun-
- terpart to REG_EXTENDED to improve readabil-
- ity.
-
- REG_NOSPEC Compile with recognition of all special
- characters turned off. All characters are
- thus considered ordinary, so the ``RE'' is a
- literal string. This is an extension, com-
- patible with but not specified by POSIX
- 1003.2, and should be used with caution in
-
-
-
- March 20, 1994 1
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- software intended to be portable to other
- systems. REG_EXTENDED and REG_NOSPEC may
- not be used in the same call to _r_e_g_c_o_m_p.
-
- REG_ICASE Compile for matching that ignores
- upper/lower case distinctions. See _r_e___f_o_r_-
- _m_a_t(7)_.
-
- REG_NOSUB Compile for matching that need only report
- success or failure, not what was matched.
-
- REG_NEWLINE Compile for newline-sensitive matching. By
- default, newline is a completely ordinary
- character with no special meaning in either
- REs or strings. With this flag, `[^'
- bracket expressions and `.' never match new-
- line, a `^' anchor matches the null string
- after any newline in the string in addition
- to its normal function, and the `$' anchor
- matches the null string before any newline
- in the string in addition to its normal
- function.
-
- REG_PEND The regular expression ends, not at the
- first NUL, but just before the character
- pointed to by the _r_e___e_n_d_p member of the
- structure pointed to by _p_r_e_g. The _r_e___e_n_d_p
- member is of type _c_o_n_s_t _c_h_a_r _*. This flag
- permits inclusion of NULs in the RE; they
- are considered ordinary characters. This is
- an extension, compatible with but not speci-
- fied by POSIX 1003.2, and should be used
- with caution in software intended to be
- portable to other systems.
-
- When successful, _r_e_g_c_o_m_p returns 0 and fills in the struc-
- ture pointed to by _p_r_e_g. One member of that structure
- (other than _r_e___e_n_d_p) is publicized: _r_e___n_s_u_b, of type
- _s_i_z_e___t, contains the number of parenthesized subexpres-
- sions within the RE (except that the value of this member
- is undefined if the REG_NOSUB flag was used). If _r_e_g_c_o_m_p
- fails, it returns a non-zero error code; see DIAGNOSTICS.
-
- _R_e_g_e_x_e_c matches the compiled RE pointed to by _p_r_e_g against
- the _s_t_r_i_n_g, subject to the flags in _e_f_l_a_g_s, and reports
- results using _n_m_a_t_c_h, _p_m_a_t_c_h, and the returned value. The
- RE must have been compiled by a previous invocation of
- _r_e_g_c_o_m_p. The compiled form is not altered during execu-
- tion of _r_e_g_e_x_e_c, so a single compiled RE can be used
- simultaneously by multiple threads.
-
- By default, the NUL-terminated string pointed to by _s_t_r_i_n_g
- is considered to be the text of an entire line, minus any
- terminating newline. The _e_f_l_a_g_s argument is the bitwise
-
-
-
- March 20, 1994 2
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- OR of zero or more of the following flags:
-
- REG_NOTBOL The first character of the string is not the
- beginning of a line, so the `^' anchor
- should not match before it. This does not
- affect the behavior of newlines under
- REG_NEWLINE.
-
- REG_NOTEOL The NUL terminating the string does not end
- a line, so the `$' anchor should not match
- before it. This does not affect the behav-
- ior of newlines under REG_NEWLINE.
-
- REG_STARTEND The string is considered to start at
- _s_t_r_i_n_g + _p_m_a_t_c_h[0]._r_m___s_o and to have a ter-
- minating NUL located at _s_t_r_i_n_g +
- _p_m_a_t_c_h[0]._r_m___e_o (there need not actually be
- a NUL at that location), regardless of the
- value of _n_m_a_t_c_h. See below for the defini-
- tion of _p_m_a_t_c_h and _n_m_a_t_c_h. This is an
- extension, compatible with but not specified
- by POSIX 1003.2, and should be used with
- caution in software intended to be portable
- to other systems. Note that a non-zero
- _r_m___s_o does not imply REG_NOTBOL; REG_STAR-
- TEND affects only the location of the
- string, not how it is matched.
-
- See _r_e___f_o_r_m_a_t(7) for a discussion of what is matched in
- situations where an RE or a portion thereof could match
- any of several substrings of _s_t_r_i_n_g.
-
- Normally, _r_e_g_e_x_e_c returns 0 for success and the non-zero
- code REG_NOMATCH for failure. Other non-zero error codes
- may be returned in exceptional situations; see DIAGNOS-
- TICS.
-
- If REG_NOSUB was specified in the compilation of the RE,
- or if _n_m_a_t_c_h is 0, _r_e_g_e_x_e_c ignores the _p_m_a_t_c_h argument
- (but see below for the case where REG_STARTEND is speci-
- fied). Otherwise, _p_m_a_t_c_h points to an array of _n_m_a_t_c_h
- structures of type _r_e_g_m_a_t_c_h___t. Such a structure has at
- least the members _r_m___s_o and _r_m___e_o, both of type _r_e_g_o_f_f___t
- (a signed arithmetic type at least as large as an _o_f_f___t
- and a _s_s_i_z_e___t), containing respectively the offset of the
- first character of a substring and the offset of the first
- character after the end of the substring. Offsets are
- measured from the beginning of the _s_t_r_i_n_g argument given
- to _r_e_g_e_x_e_c. An empty substring is denoted by equal off-
- sets, both indicating the character following the empty
- substring.
-
- The 0th member of the _p_m_a_t_c_h array is filled in to indi-
- cate what substring of _s_t_r_i_n_g was matched by the entire
-
-
-
- March 20, 1994 3
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- RE. Remaining members report what substring was matched
- by parenthesized subexpressions within the RE; member _i
- reports subexpression _i, with subexpressions counted
- (starting at 1) by the order of their opening parentheses
- in the RE, left to right. Unused entries in the array--
- corresponding either to subexpressions that did not par-
- ticipate in the match at all, or to subexpressions that do
- not exist in the RE (that is, _i > _p_r_e_g->_r_e___n_s_u_b)--have
- both _r_m___s_o and _r_m___e_o set to -1. If a subexpression par-
- ticipated in the match several times, the reported sub-
- string is the last one it matched. (Note, as an example
- in particular, that when the RE `(b*)+' matches `bbb', the
- parenthesized subexpression matches each of the three `b's
- and then an infinite number of empty strings following the
- last `b', so the reported substring is one of the emp-
- ties.)
-
- If REG_STARTEND is specified, _p_m_a_t_c_h must point to at
- least one _r_e_g_m_a_t_c_h___t (even if _n_m_a_t_c_h is 0 or REG_NOSUB was
- specified), to hold the input offsets for REG_STARTEND.
- Use for output is still entirely controlled by _n_m_a_t_c_h; if
- _n_m_a_t_c_h is 0 or REG_NOSUB was specified, the value of
- _p_m_a_t_c_h[0] will not be changed by a successful _r_e_g_e_x_e_c.
-
- _R_e_g_e_r_r_o_r maps a non-zero _e_r_r_c_o_d_e from either _r_e_g_c_o_m_p or
- _r_e_g_e_x_e_c to a human-readable, printable message. If _p_r_e_g
- is non-NULL, the error code should have arisen from use of
- the _r_e_g_e_x___t pointed to by _p_r_e_g, and if the error code came
- from _r_e_g_c_o_m_p, it should have been the result from the most
- recent _r_e_g_c_o_m_p using that _r_e_g_e_x___t. (_R_e_g_e_r_r_o_r may be able
- to supply a more detailed message using information from
- the _r_e_g_e_x___t.) _R_e_g_e_r_r_o_r places the NUL-terminated message
- into the buffer pointed to by _e_r_r_b_u_f, limiting the length
- (including the NUL) to at most _e_r_r_b_u_f___s_i_z_e bytes. If the
- whole message won't fit, as much of it as will fit before
- the terminating NUL is supplied. In any case, the
- returned value is the size of buffer needed to hold the
- whole message (including terminating NUL). If _e_r_r_b_u_f___s_i_z_e
- is 0, _e_r_r_b_u_f is ignored but the return value is still cor-
- rect.
-
- If the _e_r_r_c_o_d_e given to _r_e_g_e_r_r_o_r is first ORed with
- REG_ITOA, the ``message'' that results is the printable
- name of the error code, e.g. ``REG_NOMATCH'', rather than
- an explanation thereof. If _e_r_r_c_o_d_e is REG_ATOI, then _p_r_e_g
- shall be non-NULL and the _r_e___e_n_d_p member of the structure
- it points to must point to the printable name of an error
- code; in this case, the result in _e_r_r_b_u_f is the decimal
- digits of the numeric value of the error code (0 if the
- name is not recognized). REG_ITOA and REG_ATOI are
- intended primarily as debugging facilities; they are
- extensions, compatible with but not specified by POSIX
- 1003.2, and should be used with caution in software
- intended to be portable to other systems. Be warned also
-
-
-
- March 20, 1994 4
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- that they are considered experimental and changes are pos-
- sible.
-
- _R_e_g_f_r_e_e frees any dynamically-allocated storage associated
- with the compiled RE pointed to by _p_r_e_g. The remaining
- _r_e_g_e_x___t is no longer a valid compiled RE and the effect of
- supplying it to _r_e_g_e_x_e_c or _r_e_g_e_r_r_o_r is undefined.
-
- None of these functions references global variables except
- for tables of constants; all are safe for use from multi-
- ple threads if the arguments are safe.
-
- IIMMPPLLEEMMEENNTTAATTIIOONN CCHHOOIICCEESS
- There are a number of decisions that 1003.2 leaves up to
- the implementor, either by explicitly saying ``undefined''
- or by virtue of them being forbidden by the RE grammar.
- This implementation treats them as follows.
-
- See _r_e___f_o_r_m_a_t(7) for a discussion of the definition of
- case-independent matching.
-
- There is no particular limit on the length of REs, except
- insofar as memory is limited. Memory usage is approxi-
- mately linear in RE size, and largely insensitive to RE
- complexity, except for bounded repetitions. See BUGS for
- one short RE using them that will run almost any system
- out of memory.
-
- A backslashed character other than one specifically given
- a magic meaning by 1003.2 (such magic meanings occur only
- in obsolete [``basic''] REs) is taken as an ordinary char-
- acter.
-
- Any unmatched [ is a REG_EBRACK error.
-
- Equivalence classes cannot begin or end bracket-expression
- ranges. The endpoint of one range cannot begin another.
-
- RE_DUP_MAX, the limit on repetition counts in bounded rep-
- etitions, is 255.
-
- A repetition operator (?, *, +, or bounds) cannot follow
- another repetition operator. A repetition operator cannot
- begin an expression or subexpression or follow `^' or `|'.
-
- `|' cannot appear first or last in a (sub)expression or
- after another `|', i.e. an operand of `|' cannot be an
- empty subexpression. An empty parenthesized subexpres-
- sion, `()', is legal and matches an empty (sub)string. An
- empty string is not a legal RE.
-
- A `{' followed by a digit is considered the beginning of
- bounds for a bounded repetition, which must then follow
- the syntax for bounds. A `{' _n_o_t followed by a digit is
-
-
-
- March 20, 1994 5
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- considered an ordinary character.
-
- `^' and `$' beginning and ending subexpressions in obso-
- lete (``basic'') REs are anchors, not ordinary characters.
-
- SSEEEE AALLSSOO
- grep(1), re_format(7)
-
- POSIX 1003.2, sections 2.8 (Regular Expression Notation)
- and B.5 (C Binding for Regular Expression Matching).
-
- DDIIAAGGNNOOSSTTIICCSS
- Non-zero error codes from _r_e_g_c_o_m_p and _r_e_g_e_x_e_c include the
- following:
-
- REG_NOMATCH regexec() failed to match
- REG_BADPAT invalid regular expression
- REG_ECOLLATE invalid collating element
- REG_ECTYPE invalid character class
- REG_EESCAPE \ applied to unescapable character
- REG_ESUBREG invalid backreference number
- REG_EBRACK brackets [ ] not balanced
- REG_EPAREN parentheses ( ) not balanced
- REG_EBRACE braces { } not balanced
- REG_BADBR invalid repetition count(s) in { }
- REG_ERANGE invalid character range in [ ]
- REG_ESPACE ran out of memory
- REG_BADRPT ?, *, or + operand invalid
- REG_EMPTY empty (sub)expression
- REG_ASSERT ``can't happen''--you found a bug
- REG_INVARG invalid argument, e.g. negative-length string
-
- HHIISSTTOORRYY
- Originally written by Henry Spencer. Altered for inclu-
- sion in the 4.4BSD distribution.
-
- BBUUGGSS
- This is an alpha release with known defects. Please
- report problems.
-
- There is one known functionality bug. The implementation
- of internationalization is incomplete: the locale is
- always assumed to be the default one of 1003.2, and only
- the collating elements etc. of that locale are available.
-
- The back-reference code is subtle and doubts linger about
- its correctness in complex cases.
-
- _R_e_g_e_x_e_c performance is poor. This will improve with later
- releases. _N_m_a_t_c_h exceeding 0 is expensive; _n_m_a_t_c_h exceed-
- ing 1 is worse. _R_e_g_e_x_e_c is largely insensitive to RE com-
- plexity _e_x_c_e_p_t that back references are massively expen-
- sive. RE length does matter; in particular, there is a
- strong speed bonus for keeping RE length under about 30
-
-
-
- March 20, 1994 6
-
-
-
-
-
- REGEX(3) REGEX(3)
-
-
- characters, with most special characters counting roughly
- double.
-
- _R_e_g_c_o_m_p implements bounded repetitions by macro expansion,
- which is costly in time and space if counts are large or
- bounded repetitions are nested. An RE like, say,
- `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (even-
- tually) run almost any existing machine out of swap space.
-
- There are suspected problems with response to obscure
- error conditions. Notably, certain kinds of internal
- overflow, produced only by truly enormous REs or by multi-
- ply nested bounded repetitions, are probably not handled
- well.
-
- Due to a mistake in 1003.2, things like `a)b' are legal
- REs because `)' is a special character only in the pres-
- ence of a previous unmatched `('. This can't be fixed
- until the spec is fixed.
-
- The standard's definition of back references is vague.
- For example, does `a\(\(b\)*\2\)*d' match `abbbd'? Until
- the standard is clarified, behavior in such cases should
- not be relied on.
-
- The implementation of word-boundary matching is a bit of a
- kludge, and bugs may lurk in combinations of word-boundary
- matching and anchoring.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- March 20, 1994 7
-
-
-